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Abstract 

In this paper, we consider the Integrated Completed Likelihood (ICL) as a useful criterion for 
estimating the number of changes in the underlying distribution of data in problems where detecting the 
precise location of these changes is the main goal. The exact computation of the ICL requires 0(Kn 2 ) 
operations (with K the number of segments and n the number of data-points) which is prohibitive in 
many practical situations with large sequences of data. We describe a framework to estimate the ICL 
with O(Kn) complexity. Our approach is general in the sense that it can accommodate any given model 
distribution. We checked the run-time and validity of our approach on simulated data and demonstrate 
its good performance when analyzing real Next-Generation Sequencing (NGS) data using a negative 
binomial model. 
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1 Introduction 

The estimation of the number of segments is a central aspect in change-point methodology. For instance, 
in the context of CGH-array or Next-Generation Sequencing experiments, identifying the number and 
corresponding location of segments is crucial as the segments may relate to a biological event of inter- 
est. This theoretically complex problem can be handled in the more general context of model selection, 
leading to the use of ad hoc procedures in practical situations. 

Among the procedures are the use of classical criteria based on penalized likelihoods such as the 
Akaike Information Criterion (AIC) and the Bayes Information Criterion (BIC or SIC, see Yao, 1988). 
However, when choosing the number of segments, the BIC criterion uses a Laplace approximation re- 
quiring differentiability conditions not satisfied by the model, which thus may not be appropriate when 
the number of observations in each segment are unequal and unknown. These criteria also tend to over- 
estimate the number of segments as the clustering within segments tends to be ignored, as shown by 
Zhang and Siegmund (2007) who proposed a modified BIC criterion using a Brownian motion model 
with changing drift for the specific case of normal data. 

For this reason, there has been an extensive literature influenced by Birge and Massart (2001) which 
proposes new penalty shapes and constants in order to select a lower number of segments in the profile. 
The idea is to choose the model that, within a set of models, performs closest to the true value by deriving a 
tight upper bound on the variance term. This leads to penalties that generally depend only on the number 
of segments K, and whose constants can be chosen adaptively to the data (Lavielle, 2005; Lebarbier, 
2005). However, a large proportion of those methods are developed specifically for normal data, and are 
not applicable to count datasets modeled by the Poisson or the negative binomial distributions. 

Other approaches for model selection appearing in the literature include sequential likelihood ratio 
tests (Haccou and Meelis, 1988) and Bayesian approaches through estimating model posterior proba- 
bilities by various MCMC methods (Green, 1995; Chib, 1998; Andrieu et al., 2001; Fearnhead, 2005). 
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However, the Bayesian approaches are often computationally intensive as they require re-sampling. 

In the context of incomplete data models (e.g. mixture model for clustering) Biernacki et al. (2000) 
proposed a model selection criterion accounting for both observed and unobserved variables based on the 
Integrated Completed Likelihood (ICL): J2s ^(S\X) logP(5|X) where X are the observations and S 
are the corresponding (unknown) clustering membership. 

Rigaill et al. (2010) proposed the use of the ICL criterion in the multiple change-point detection 
context. Hence, the segmentation S can be considered as an unobserved variable in the sense that the 
segment-labels of each datapoint are not known. In this context, we can select the number of segments 
as: 

K = argrmnlCL(lT) where \CL(K) = - \ogV{X\K) + H(K), (1) 

and U(K) = - Y^seM p ( s \ x > K ) logP(5|X, K), M K representing the set of all segmentations of 
the signal in K segments. 

The entropy term H(K) can be viewed as an intrinsic penalty to quantify the reliability of a given 
model with K segments. In the segmentation context, an exact algorithm with 0(Kn 2 ) complexity 
computes the ICL. In a simulation study, Rigaill et al. (2010) showed that the ICL performed better 
than standard Bayesian model selection criteria such as BIC or Deviance Information Criterion (DIC). 
However the quadratic complexity restricts the use of this Bayesian ICL to relatively small profiles. 

In this paper we suggest a fast procedure to both compute an estimate of the ICL criterion with linear 
complexity and apply it to estimate the number of segments within a set of change-point data. Section 
2 describes the ICL estimation procedure, through the use of a constrained hidden Markov model and 
section 3 validates the approach by presenting the results of different simulations for detecting the correct 
number of change-points. Finally, Section 4 is a discussion of our method supported by the comparison 
of a few segmentation algorithms on data-sets simulated by re-sampling real RNA-seq data. 
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2 Integrated Completed Likelihood criterion estimation using HMM 

In this paper we use the following segment-based model for the distribution of X given a segmentation 

S E M K - 

n K 

P(X\S- 1 9) = Y[g eSz (X t ) = l[ JJ g Bk (X t ) (2) 

i=l k=li,Si = k 

where gg s . (•) is the parametric distribution (ex: normal, Poisson, negative binomial, etc.) with parameter 
6 s ( , = . . . , 6k) is the global parameter, Si G {1, . . . , K} is the index of the segment at position i 
(ex: Si-,5 — 11222 corresponds to a segmentation of n = 5 points into K = 2 segments with a change- 
point occurring between positions 2 and 3), and Mk is the set of all possible partitions of Si, . . . , S n 
with a fixed K number of segments, such that Si = 1 and S„ = K, and Si — Si-i 6 {0, 1} for all 
i = 2, . . . , n. 

2.1 ICL estimation using the entropy 

We propose the following approach for fast segmentation and model selection in change-point problems. 
The ICL expression in Equation (1) can also be approximated as the Bayesian Information Criterion 
(BIC) with an entropy term, such that 

ICL hmm (A-) = BIC(JQ + H(K), (3) 

with K being the number of segments. 

The entropy H(K) can be viewed as a penalty term that characterizes the separation of the observa- 
tions in different segments. In other words, the larger the differences between the best segmentation and 
others in K segments, the lower H(K). For fixed K segments, the entropy H(K) thus will be lower 
when the best segmentation provides a much better fit compared to other segmentations with the same 
number of segments. Thus, in addition to penalizing for additional parameters through the BIC, the ICL 
criterion also favors models which provide the most evidence of a clustering effect within the detected 
segments. 
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As will be described in Section 2.2, ¥(S\X, K; 6k) corresponds to a heterogeneous Markov chain 
whose starting distributions \x\ and i th transition i\i can both be recursively computed. We hence express 
the entropy term H(K): 

U(K) = -J2 F ( S \ X > K > $k) log nS\X, K, 9 K ) 
s 

n / n \ 

= -^2» 1 (S 1 )l[n(Si-i,S i ) logM 1 (S , i) + ^log7r i (S' i _ 1 ,S i ) 

S i=2 \ i=2 / 

where Ml (5i) = nSi\X,K,6 K ) and for * ^ 2, 7* = F(Sj|<Si_i, X, K, 6k)- 

With some simple computations, we obtain the following expression (Hernando et al., 2005): 



U(K) 



fii(Si) log fJ,i(Si) ^ ^j-ifSj-i^fSj-i^^logTr^Si-i,.^ 



(4) 



Si i=2 S < _i,S 4 

where for all i, fii(Si) = ¥(Si\X, K, 6k) can be recursively obtained for all 2 ^ i ^ n through 
Hi(Si) = J2s- 1 Mi-i ( < S'i-i) 7r i('S'i-i: Section 2.2 describes a fast method for obtaining these poste- 
rior quantities of interest. 

The entropy H(K) expression includes posterior probabilities, thus requiring the estimation of the 
posterior distribution of S. With quadratic complexity 0(Kn 2 ) (Rigaill et al., 2010), its exact computation 
is usually intractable for large datasets of tens of thousands points or more. We describe a frequentist 
framework for estimating, with linear complexity O(Kn), the posterior probabilities in both the BIC and 
entropy expressions required to estimate the ICL. 

2.2 Fast estimation of posterior quantities in ICL criterion 

We introduce a constrained hidden Markov model (Luong et al., 2012) that corresponds exactly to a 
segmentation model, where the change-points separate segments consisting of contiguous observations 
with the same distribution. 

Introducing a prior distribution P(5) on any S 6 Mk, we obtain a posterior distribution of the 
segmentation: 

ns\x-o) nx\s-,6)ns) 



For a uniform prior, set = Qj-lJ = |A^k|. 

Let us assume that S is a heterogeneous Markov chain over {1, 2, . . . , K, K + 1}. We only allow for 
transitions of or +1 by constraining the chain with: 

P(5i = 1) = 1 

V2<^n, VI O = k }^ = \ =1 7.? fc(i) 

[ P(5j = fc + = ft) = r) k {i) 

To estimate the properties of the K th state we introduce a 'junk' state K + 1, and for consistency we 
choose P(5j = if + l|<Si-i = K + 1) = 1. Our frequentist approach estimates the emission distribution 
by using the maximum likelihood estimate g§ {xi), or alternatively the E-M algorithm. We also may 
specify a homogeneous HMM by setting rjk (i) = rj for all k = 1, . . . , K. 

We define the forward and backward quantities as follows for observation i and state k: For 1 ^ i ^ 
n — 1: 

F i {k)=¥(X 1:i = x 1:i ,S i = k) 

Bi{k) = P(X i+ i : „ = x i+Un , S n = K\Si = k). 

We may use the following recursions to estimate the forward and backward quantities: 

Fl(k) = f9eM) if fc = 1 
else 



Fi(k) = - »j fc (i)) + l fc >iJ!i_i(A: - l)Tj fc (*)] g §k {xi) 

{Vk ( n )g§ k ( x n) if k = K - 1 

(1 - r)K{n))g §k (x n ) ifk = K 
else 

Bi-i(k) = (1 - r)k(i))gg k {xi)Bi(k) + l k<K Vk+i(i)g§ k+1 (x l )B i (k + 1) 

These quantities can then be used to obtain the marginal distributions and the transition 7T; defined 
above with the entropy H(K) with: 
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7Ti(fc, k') 



F t {k)B l {k) 
Fi(l)Bx(l) 

F(S i = k>\S l - 1 =k)g §k (x l )B l (k') 



(6) 



(7) 



where 



P(5i = = fc) 



n 
o 



i 



77 if &' 
if A' 
else 



= & 



k + 1 



2.3 Model selection procedure using ICL 

In this constrained HMM method, we first identify a set of change-points for a fixed number of seg- 
ments K through a fast method. One such example for normally distributed data is a K-means algo- 
rithm (Hartigan and Wong, 1979; Comte and Rozenholc, 2004), which is a greedy method that mini- 
mizes the least-squares criterion. We also used a pruned dynamic programming algorithm (PDPA, see 
Rigaill, 2010) to obtain an initial set of change-points, which is a fast algorithm to compute the op- 
timal segmentation according to loss functions including Poisson, negative binomial or normal losses. 
After obtaining this segmentation to obtain the maximum likelihood estimates for the parameters, we 
use the Viterbi algorithm to obtain the most probable set of change-point as an initial segmentation 
for the constrained HMM, and estimate the model selection criteria using the frequentist ICL using 
HMM (ICLj lmm ) for K number of segments. To estimate the ICLj lmm (i ; i') of a change-point model 
with K segments, we estimate posterior probabilities of interest through the forward-backward algo- 
rithm as previously described, which is implemented in the postCP package available on CRAN (http : 
//cran.r-project. org/ web /packages /post CP). 

This procedure is repeated for a range of possible values of K: K range = {K m im ■ ■ ■ , K max }. We 
estimate the number of correct segments by minimizing the ICL of the best identified segmentation with 
number of segments K, as a function of K . For the ICL, this can be expressed as 



%X = argmin lCL hmm {K). 



(8) 



K 
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3 Validation 

To validate the quality of our approach we evaluated the results of our algorithm on two sets of simula- 
tions. The first consisted of relatively small signals (n = 500 points) on which we compared our approach 
to the quadratic Bayesian algorithm. On the second, with larger signals (n = 10000), we only ran the 
ICLhmm algorithm due to computing limitations. 
The simulation designs were as follows: 

Small design. We used a similar simulation design as that suggested in Rigaill et al. (2010): we simu- 
lated a sequence of 500 observations from a Poisson model (requiring the choice of only one parameter) 
affected by six change-points at the following positions: 22, 65, 108, 219, 252 and 435. Odd segments 
had a mean of 1, while even segments had a mean of 1 + A, with lambda varying from to 10; the 
true number of change-points being more easily identified with higher As. For each configuration, we 
simulated 350 sequences. 

Large design. We repeated the preceding procedure for large-scale datasets. We generated a sequence 
of 10000 observations with K = 25 segments by randomly selecting 24 change-points whose locations 
were drawn from a uniform distribution (without replacement), with each segment needing to be at least 
of length 25. For this sample size, we focused on the results from the ICLhmm as the Bayesian ICL 
implementation is not fast enough to be practical in this situation. For each configuration, we simulated 
200 sequences. 

We compared the performances of three different criteria: 

• The ICLhmm greedy criterion as described in the previous section and given by equation (8), with 
initial parameters obtained by the greedy algorithm using least-squares. 

• The ICLhmm exact criterion. The criterion is the same as in the ICLhmm greedy approach, but 
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Figure 1 : Performance of model selection criteria, n — 500 

the initial parameters are given by the Pruned Dynamic Programming Algorithm with Poisson loss. 

• The Bayesian ICL criterion as described in (Rigaill et al., 2010). The hyper-parameters used for 
the prior on the data-distribution were set to 2. 

Figure 1 summarizes the results of the simulation study for simulations of length 500. While the 
Bayesian ICL criterion had the highest amount of correct segment estimates K, the faster ICLhmm with 
pruned PDPA performed almost as well. Of note, the average run-times of the methods were 9.7 seconds 
for the Bayesian approach, 1.2 seconds for the HMM with the greedy segmentation, and 3.8 seconds for 
the HMM with PDPA. 

Figure 2 summarizes the results of the simulation study for simulations of length 10000. For these 
larger sequences, the ICL criteria performed much better when the initial change-point set was detected 
by PDPA than with the greedy algorithm. As the segmentation problem becomes more difficult with 
more segments, the greedy algorithm is less successful in providing accurate initial change-point location 
estimates. As a result, less accurate values of 8 are used and ICLhmm is not as effective in predicting 
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Figure 2: Performance of model selection criteria, n = 10000 

the number of segments as in the smaller sample size example. 

On the other hand, the ICL combined with PDPA detected the correct number of segments more than 
80% of the time with larger intersegmental differences of A > 2. The average run-time for the model 
selection was 1 1.4 seconds for the HMM with the greedy segmentation, and 59.3 seconds for the HMM 
with PDPA. Despite the longer run-time, it is advised to use the PDPA for model selection in very long 
sequences as it provides a more accurate set of change-points than greedy methods. 

4 Discussion 

4.1 Re-sampling of negative Binomial data simulation 

To assess the quality of our criteria, we performed the following simulation study to compare two pre- 
viously published packages on CRAN, segclust (Picard et al., 2007) and DNAcopy (Venkatraman and 
Olshen, 2007), with our model selection method with the ICL criterion. We performed the following 
re-sampling procedure using real RNA-seq data. The original data, from a study by the Sherlock Ge- 
nomics laboratory at Stanford University, is publicly available on the NCBI's Sequence Read Archive 
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(SRA, http : / / www . ncbi . nlm . nih . gov/sra) with the accession number SRA048710. We clus- 
tered the observed signal into the following classes: intronic region, low expressed, medium expressed 
and highly expressed genes that we will refer to as levels 1, 2, 3 and 4. We then designed four simulation 
studies, each repeated 100 times, by varying the number and level of segments as well as the signal and 
segment sizes, as described in Figures 3(a) through 6(a). On each segment, the data was obtained by 
re-sampling (with replacement) the observations in the classified clusters. 

To assess the performance of the ICLhm m approach in segmentation, we used 3 different distribu- 
tions as the emission distribution ge(-) a normal distribution (postCP-N), a Poisson distribution (postCP- 
P) and negative binomial (postCP-NB). For segclust, DNAcopy, and postCP-N, which assume a normal 
distribution, we applied the methods to the data after the widely used log(.x + 1) transformation. For the 
postCP procedures we used PDPA to obtain the initial segmentation. 

In the simplest case, Figure 3(a) illustrates the resampling scheme for n = 1, 000 and K = 10 evenly 
spaced segments, displaying the levels used for each segment and the change-point locations. Figure 3(b) 
displays a boxplot of the number of segments found by each approach. In this quite unrealistic scenario, 
postCP-BN estimated the correct number of segments in 63 of 100 replicates. The next best algorithms 
were postCP-N and DNAcopy, respectively, which both slightly underestimated the number of segments. 
The segclust procedure provided a consistent underestimate of the number of segments, while postCP-P 
grossly overrated the number of segments in all replicates and the results are not displayed here. 

Figures 4(a) and 4(b) illustrates the re-sampling schemes and boxplots for a slightly different and 
more realistic scenario of n = 1, 000 and K = 10, with unevenly spaced segments this time. The results 
are comparable to the previous except that the methods performed slightly worse; the median postCP-NB 
estimate was still correct but missed 1 or 2 segments in 43 replicates. This suggests that postCP has more 
difficulties in detecting small segments. 

We then replicated the methods for larger data sets and unevenly spaced segments. Figures 5(a) and 
5(b) display the methods and results for a n = 5, 000 and K = 10 scenario. In this case, DNAcopy 
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(a) Re-sampling schema 



(b) Boxplot of number of segments K 



Figure 3: n = 1 , 000 observations and K = 10 equal segments, (a) Re-sampling schema displaying levels 
and lengths of segments, (b) Boxplot of estimated number of segments K for four different segmentation 
procedures. 



(a) Re-sampling schema 



(b) Boxplot of number of segments K 



Figure 4: n = 1, 000 observations and K = 10 uneven segments, (a) Re-sampling schema displaying lev- 
els and lengths of segments, (b) Boxplot of estimated number of segments K for 4 different segmentation 
procedures. 
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(a) Re-sampling schema (b) Boxplot of number of segments K 

Figure 5: n = 5, 000 observations and A' = 10 uneven segments, (a) Re-sampling schema displaying lev- 
els and lengths of segments, (b) Boxplot of estimated number of segments K for 4 different segmentation 
procedures. 

performs best, with the median number of estimated segments being correct. The postCP-NB method 
gave similar results but missed two change-points in 66 of the replicates. The segclust algorithm, once 
again, found consistent but overly conservative estimates of the number of segments, while postCP-N 
grossly overestimated the segments as the log-transformation was not adequate in this design. We took a 
closer look at the optimal segmentations w.r.t. to negative binomial likelihood in K = 10 segments. We 
found that in 48 replicates out of 100 this segmentation did not include the second simulated segment. 
This suggest that, at least in these 48 replicates, precisely recovering the position of the first two changes 
might be difficult. Thus by selecting K = 8 change-points rather than 10, postCP-NB is coherent with 
the goal of the ICL (i.e. selecting a number of segments such that we are confident in the position of these 
changes). 

In a n = 10, 000 and K = 20 scenario with uneven segments (Figures 6(a) and 6(b)), DNAcopy was 
again best, with postCP-N and postCP-NB almost as effective, the former method slightly underestimat- 
ing the number of segments and the latter approach slightly overestimating them. 

In the investigated scenarios, we found postCP, when the correct negative binomial distribution was 
specified, provided accurate and precise results when segments were evenly spaced, but provided slightly 
less accurate results in more realistic scenarios where segment lengths were uneven. The results with 
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(a) Re-sampling schema (b) Boxplot of number of segments K 

Figure 6: n = 10, 000 observations and K = 20 uneven segments, (a) Re-sampling schema displaying 
levels and lengths of segments, (b) Boxplot of estimated number of segments K for 4 different segmen- 
tation procedures. 

postCP-N and postCP-P suggest that the postCP approach may be susceptible to misspecification of the 
emission distribution, particularly when there are very small segments present (Figure 5(b)). 

On the other hand, DNAcopy tended to underestimate segments in easier scenarios, where segments 
where even, but obtained more accurate results with more realistic uneven segments. The hybrid seg- 
mentation and clustering approach, segclust, generally was consistent but underestimated the number of 
segments. 

4.2 Conclusion 

We describe a fast procedure for estimating the ICL criterion in the context of model selection for seg- 
mentation. While simulations showed that the performance of the ICLhmm approach was almost as 
good as that of the exact Bayesian approach, several features allow for its use in a wider range of appli- 
cations. The described ICLhmm algorithm is more versatile as it can be applied to data of any model 
distribution provided we can initialize the HMM, through either maximum likelihood estimation or the 
expectation-maximization (E-M) algorithm. It thus provides a model selection criterion in situations with 
limited options for segmentation, for example when data have Poisson or negative binomial distributions. 
Furthermore, the procedure can be applied to longer signals due to its faster run-time. With its effective 
results in finding the number of segments, specifically those where the precise location of the change- 
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points can be estimated, this paper shows the practicality of the ICLhmm procedure in a wide variety of 
segmentation problems. Its implementation will very shortly be available in the package postCP on the 
CRAN. 
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